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DETAILED ACTION 



Allowable Subject Matter 



1. Claims 4, 5, 7-10, 13, 15, 16, 20, 21, 23, 25, 26, 28, 30, 31, and 33-35 are 
objected to as being dependent upon a rejected base claim, but would be allowable if 
rewritten in independent form including all of the limitations of the base claim and any 
intervening claims. 

2. The following is a statement of reasons for the indication of allowable subject 
matter: 

As to claims 4 and 20, Anderson (6,499, 016) in combination of Li et al. 
(6,397,181) do not teach or fairly suggest a lexicon selected based on the user speech 
and a predefined heuristic relating to voice tags in combination with the media capture 
device of claims 1 and 18. 

As to claim 5, Anderson in combination of Li et al. do not teach or fairly suggest 
a user is able to select a lexicon the media and navigation between lexicons in 
combination with the media capture device of claim 1 . 

As to claim 7, Anderson in combination of Li et al. do not teach nor fairly suggest 
using sound similarity metrics to align an annotation with a spoken query in combination 
with the media capture device of claim 1. 
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As to claim 8, Anderson in combination of Li et al. do not teach nor fairly suggest 
a lexicon editor adapted to supplement a lexicon based on annotation, letter to sound 
rules, or user speech corresponding to spelled word input in combination with the media 
capture device of claim 1 . 

As to claim 9, Anderson in combination of Li et al. do not teach nor fairly suggest 
a post processor having greater speech recognition capabilities than the media capture 
device of claim 1 . 

As to claim 10, Anderson in combination of Li et al. do not teach nor fairly 
suggest a external data interface receptive of lexicon contents, and a lexicon editor 
adapted to store the lexicon contents in the memory of the media capture device of 
claim 1. 

As to claim 13, Anderson in combination of Li et al. do not teach nor fairly 
suggest communicating focused lexica to said post-processor over a communications 
network in combination with the media capture device of claim 1 1 . 

As to claim 1 5, Anderson in combination of Li et al. do not teach nor fairly 
suggest converting textual tags in combination with the media capture device of claim 
11. 

As to claim 16, Anderson in combination of Li et al. do not teach nor fairly 
suggest a device that is adapted to perform a relatively limited amount of speech 
recognition on the annotation compared to an amount of speech recognition performed 
by the post-processor in combination with the media capture device of claim 11.- 
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As to claim 21 , Anderson in combination of Li et al. do not teach selecting a 
lexicon based on a user interface in combination with the media capture device of claim 
18. 

As to claim 23, Anderson in combination of Li et al. do not teach nor fairly 
suggest selecting a lexicon based on user identity relating to the media capture device 
in combination with the media capture device of claim 18. 

As to claim 25, Anderson in combination of Li et al. do not teach nor fairly 
suggest a retrieval mode using sound similarity metrics to align an annotation with a 
spoken query in combination with the media capture device of claim 18. 

As to claim 26, Anderson in combination of Li et al. do not teach nor fairly 
suggest supplementing a lexicon stored in device memory based on an annotation, 
letter to sound rules, and user speech corresponding to spell word input received and 
recognized during a lexicon edit mode of the device in combination with the media 
capture device of claim 18. 

As to claim 28, Anderson in combination of Li et al. do not teach nor fairly 
suggest transferring annotations from the device to a post processor in combination with 
the media capture device of claim 18. 

As to claim 31 , Anderson in combination of Li et al. do not teach nor fairly 
suggest transferring focused lexica from a source of predefined, focused lexica to the 
device in combination with the media capture device of claim 18. 

As to claim 33, Anderson in combination of Li et al. do not teach nor fairly 
suggest converting textual tags associated with captured media to alternative textual 
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tags based on predetermined criteria relating to a media capture activity in combination 
with the media capture device of claim 18. 

As to claim 34, Anderson in combination of Li et al. do not teach nor fairly 
suggest clustering textual tags on semantic similarity measures in combination with the 
media capture device of claim 18. 

As to claim 35, Anderson in combination of Li et al. do not teach clustering 
textual tags on acoustic similarity measures in combination with the media capture 
device of claim 18. 

Claims 29 and 30 would also be allowable as they further limit allowable subject 
matter. 

Claim Rejections - 35 USC § 102 

1 . The following is a quotation of the appropriate paragraphs of 35 U.S.C. 1 02 that 

form the basis for the rejections under this section made in this Office action: 
A person shall be entitled to a patent unless - 

(a) the invention was known or used by others in this country, or patented or described in a printed 
publication in this or a foreign country, before the invention thereof by the applicant for a patent. 

2. Claims 11, 17, 18, 22, and 24are rejected under 35 U.S.C. 102(a) as being 
anticipated by Anderson (6,499,016). 

As to claim 1 1 , Anderson teaches: 
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a portable media capture device adapted to capture media, (digital camera, col. 

2, lines 55-57); 

to receive user speech in close temporal relation to a media capture activity 
(recording category-specific voice annotations on the camera at the time of capture, col. 

3, lines 10-14); 

adapted to annotate captured media with a sample of the user speech that is 
suitable for input to a speech recognizer based on close temporal relation between 
receipt of the user speech and capture of the captured media, (each of the image files 
the user identified for voice recognition are processed by translating each of the voice 
annotations in the image file into a text annotation, (col. 5, lines 31-34). It would be 
inherent that since each of the images are tagged with voice annotations within the 
camera, and each of the voice tags are then translated into a text annotation, each of 
the images would be tagged with a text annotation. Furthermore, it would be inherent 
for a system that creates the text annotations once a connection to the internet is 
established, where the system contains wireless capabilities (col. 5, lines 10-12), the 
text annotations could be applied in close temporal relation); 

a post processor adapted to receive annotations from the device, perform speech 
recognition on the annotations, and tag related captured media with text generated 
during speech recognition performed on the annotations, (a categorizing system that is 
able to apply categories along with voice annotations to the image as the image at the 
time of capture, col. 4, lines 16-24). 
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As to claim 17, Anderson teaches the post-processor is receptive of captured 
media from said device, and is adapted to organize the captured media according to at 
least one of annotations and textual tags associated with the captured media, including 
clustering at least one of annotations and textual tags based on at least one of acoustic 
similarity measures and semantic similarity measures, (photo albums are created by 
grouping together all the images with text annotations having matching keywords, col. 6, 
lines 39-43). 

As to claim 18, Anderson teaches: 

capturing media with the media capture device during a media capture activity 
conducted by a user of the device, (digital camera for capturing images, col. 2, lines 55- 
57); 

receiving user speech via an audio input of the device in close temporal relation 
to the media capture activity, (recording category-specific voice annotations on the 
camera at the time of capture, (col. 3, lines 10-14) each of the image files the user 
identified for voice recognition are processed by translating each of the voice 
annotations in the image file into a text annotation, (col. 5, lines 31-34). It would be 
inherent that since each of the images are tagged with voice annotations within the 
camera, and each of the voice tags are then translated into a text annotation, each of 
the images would be tagged with a text annotation. Furthermore, it would be inherent 
for a system that creates the text annotations once a connection to the internet is 
established, where the system contains wireless capabilities (col. 5, lines 10-12), the 
text annotations could be applied in close temporal relation); 
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annotating captured media by storing the captured media in memory of the 
device in association with a sample of the user speech that is suitable for input to a 
speech recognizer, (a categorizing system that is able to apply categories along with 
voice annotations to the image as the image at the time of capture, col. 4, lines 16-24); 

recognizing the user speech with a speech recognizer of the device employing a 
focused speech recognition lexicon relating to the media capture activity, (voice 
annotations are entered for the image captures, where the voice annotations relate to 
categories such as location, history, caption, and occasion, with each category related 
to the process of organizing images, col. 4, lines 15-24); 

tagging captured media with recognition text generated during recognition of the 
user speech by storing the captured media in memory of the device in association with 
the recognition text, (the voice annotations are translated into text annotations, and 
stored along with the captured image, col. 5, lines 22-27). 

As to claim 22, Anderson teaches receiving a user identity, wherein said step of 
recognizing the user speech is based on the user identity (recognizing the voice 
annotations based on the user logged into the system, col. 5, lines 20-30). 

As to claim 24, Anderson teaches retrieving captured media from memory of the 
device by matching a tag of the captured media to recognition text generated from user 
speech received and recognized during a retrieval mode of the device (the user is able 
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to select all the images corresponding to certain keywords such as images comprising 
tags, such as a search for pictures taken "Beach" on "Vacation", (col. 6, lines 39-42). 

Claim Rejections - 35 USC § 103 

3. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

4. Claims 1-3 and 6 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Anderson, in view of Li et al. (6,397,181). 

As to claim 1 , Anderson teaches: 

a media capture mechanism, (digital camera, col. 2, lines 55-57); 

an audio input receptive of user speech relating to a media capture activity in 
close temporal relation to the media capture activity, (recording category-specific voice 
annotations on the camera at the time of capture, col. 3, lines 10-14); 

a media tagger adapted to tag captured media with text generated by said 
speech recognizer based on close temporal relation between receipt of recognized user 
speech and capture of the captured media, (each of the image files the user identified 
for voice recognition are processed by translating each of the voice annotations in the 
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image file into a text annotation, (col. 5, lines 31-34). It would be inherent that since 
each of the images are tagged with voice annotations within the camera, and each of 
the voice tags are then translated into a text annotation, each of the images would be 
tagged with a text annotation. Furthermore, it would be inherent for a system that 
creates the text annotations once a connection to the internet is established, where the 
system contains wireless capabilities (col. 5, lines 10-12), the text annotations could be 
applied in close temporal relation); 

a media annotator adapted to annotate the captured media with a sample of the 
user speech that is suitable for input to a speech recognizer based on close temporal 
relation between receipt of the user speech and capture of the captured media, (a 
categorizing system that is able to apply categories along with voice annotations to the 
image as the image at the time of capture, col. 4, lines 16-24). 

Anderson does not teach: 

a plurality of focused speech recognition lexica respectively relating to media 
capture activities, nor 

a speech recognizer adapted to recognize the user speech based on a selected 
one of the focused speech recognition lexica. 

However, 

Li et al. teach: 

a plurality of focused speech recognition lexica respectively relating to media 
capture activities (a sample language defined in Bachus-Naur Form (BNF), where the 
BNF can be customized to a specific topic or a specific speaker/narrator, (col. 4, lines 
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54-67). It would be obvious to one of ordinary skill in the art at the time of the invention 
that since the sample language is customizable, there would be the ability to have a 
plurality of focused speech recognition language available to the user to increase the 
flexibility and improve the method of indexing the media, (col. 1 , lines 20-21); and 

a speech recognizer adapted to recognize the user speech based on a selected 
one of the focused speech recognition lexica, (a sample language defined in Bachus- 
Naur Form (BNF), where the BNF can be customized to a specific topic or a specific 
speaker/narrator, (col. 4, lines 54-67). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture method of Anderson with the language 
customization of Li et al. to create an improved system of indexing and retrieving media 
content, as taught by Li et al. (col. 1, lines 19-21). 

As to claim 2, Anderson teaches an input receptive of a user identity, wherein 
said speech recognizer is adapted to recognize user speech based on the user identity, 
(recognizing the voice annotations based on the user logged into the system, col. 5, 
lines 20-30). 

As to claim 3, Anderson suggests the speech recognizer is adapted to employ 
focused lexica based on the user identity (recognizing the voice annotations based on 
the user logged into the system, (col. 5, lines 20-30). It would be obvious to one of 
ordinary skill in the art to one of ordinary skill in the art at the time of the invention that 
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the voice processing service identifies the user before voice annotations are translated, 
and lexica based on that user would be loaded so the voice processing service would 
be familiar with the vocabulary of the user, increasing the ability of the voice processing 
service to create correct translations of the users speech). 

As to claim 6, Anderson teaches a media retrieval mechanism adapted to 
retrieve captured media from memory of the device by matching a tag of the captured 
media to recognition text generated from user speech received and recognized during a 
retrieval mode of the device, (the user is able to select all the images corresponding to 
certain keywords such as images comprising tags, such as a search for pictures taken 
"Beach" on "Vacation", (col. 6, lines 39-42). 

5. Claims 12, 14, 19, 27, and 32 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Anderson as applied to claims 11 and 18 above, and further in view 
of Li et al 

As to claim 12, Anderson does not teach a source of predefined, focused lexica 
relating to media capture activities and adapted to communicate focused lexica to said 
media capture device according to device type over a communication network. 

However, Li et al. teach a plurality of focused speech recognition lexica 
respectively relating to media capture activities (a sample language defined in Bachus- 
Naur Form (BNF), where the BNF can be customized to a specific topic or a specific 
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speaker/narrator, (col. 4, lines 54-67). It would be obvious to one of ordinary skill in the 
art at the time of the invention that since the sample language is customizable, there 
would be the ability to have a plurality of focused speech recognition language 
available. Furthermore, it would be obvious to one of ordinary skill in the art at the time 
of the invention that since the user is able to customize the sample language, the user 
would customize the sample language depending on the necessary elements of media 
device, to increase the flexibility and improve the method of indexing the media, (col. 1 , 
lines 20-21). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture method of Anderson with the language 
customization of Li et al. to create an improved system of indexing and retrieving media 
content, as taught by Li et al. (col. 1, lines 19-21). 

As to claim 14, Anderson does not teach a lexicon editor provided to at least one 
of the device and the post processor and adapted to customize a focused lexicon for a 
user of the device. 

However, Li et al. teach the system is able to customize the sample language, by 
defining the language to a specific topic, or a specific speaker/narrator, (col. 4, lines 54- 
67). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture method of Anderson with the language 
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customization of Li et al. to create an improved system of indexing and retrieving media 
content, as taught by Li et al. (col. 1, lines 19-21). 

As to claim 19, Anderson does not teach selecting a focused speech recognition 
lexicon relating to the media capture activity from a plurality of focused lexica relating to 
media capture activities that are stored in memory of the device. 

Li et al. teach a sample language defined in Bachus-Naur Form (BNF), where the 
BNF can be customized to a specific topic or a specific speaker/narrator, (col. 4, lines 
54-67). It would be obvious to one of ordinary skill in the art at the time of the invention 
that since the sample language is customizable, there would be the ability to have a 
plurality of focused speech recognition language available to the user to increase the 
flexibility and improve the method of indexing the media, (col. 1 , lines 20-21 ). 

As to claim 27, Anderson suggests lexicon contents and storing the lexicon 
contents in device memory, (a lexicon with commands/categories for categorizing the 
images, (col. 4, lines 8-11). It would be necessary that the lexicon be loaded and stored 
in the memory of the device, for future use). 

As to claim 32, Anderson does not teach customizing a focused lexicon for a 
user of the device. 

However, Li et al. teach customizing the language by customizing the topic or a 
specific speaker/narrator, (col. 4, lines 63-67). 
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Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to combine the media capture method of Anderson with the language 
customization of Li et al. to create an improved system of indexing and retrieving media 
content, as taught by Li et al. (col. 1, lines 19-21). 

Conclusion 

6. The prior art made of record and not relied upon is considered pertinent to 
applicant's disclosure. Kanevsky et al. (6,434,520), Charlesworth et al. (2002/0022960), 
and Liaguno et al. (5,729,741). 

Kanevsky et al. teach a system and method for indexing segments of 
audio/multimedia files and data streams for storage in a database. 

Charlesworth et al. teach determining a sequence of sub-word units 
representative of at least two words output by a word recognition unit in response to a 
common input word to be recognized. 

Liaguno et al. teach media image information storage and retrieval system 
processes information supplied by different types of media. 

7. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Thomas E Shortledge whose telephone number is 
(571 )272-7612. The examiner can normally be reached on M-F 8:00 - 4:30. 
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If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Talivaldis Smits can be reached on (571 )272-7628. The fax phone number 
for the organization where this application or proceeding is assigned is 703-872-9306. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). 

TS 

05/10/2005 




